Performance Evaluation and Comparative Analysis of Several Machine Learning Classification Techniques Using a Data-driven Approach in Predicting Renal Failure

Authors: Prawin R P, Pranav R P, Swathi R

DOI Link: https://doi.org/10.22214/ijraset.2023.54343

Abstract

Renal failure is characterized by progressive kidney function loss over time. It is a serious medical condition that affects millions of people worldwide. It is caused by the inability of the kidneys to properly filter waste and excess fluids from the blood. Renal failure can be a consequence of chronic kidney disease. Chronic kidney disease is a long-term condition that causes the kidneys to gradually lose function over time. If chronic kidney disease is not adequately managed, the kidney’s function may continue to decline, leading to renal failure. It is essential to monitor and manage chronic kidney disease to prevent renal failure from developing. This research paper presents an approach for predicting renal failure using several machine-learning classification techniques. The study evaluates the performance of various classifiers such as Decision Tree, Naive Bayes, Extreme Gradient Boosting, Logistic Regression, and Support Vector Machines using various evaluation metrics. The performance of these classifiers is evaluated using various metrics such as accuracy, precision, recall, and F1-score. This proposed method can be useful for early diagnosis and treatment of renal failure, thus reducing the complications and costs associated with the disease. By comparing and evaluating the performance of these models, we aim to identify the most effective approach for predicting renal failure and provide valuable insights for clinical practice.

Introduction

I. INTRODUCTION

The human body has two kidneys located at the back of the peritoneal cavity, which are vital organs necessary for its proper functioning. The main function of the kidneys is to regulate the balance of salt, water, and other ions and trace Elements in the human body, such as calcium, phosphorus, magnesium, potassium, chlorine, and acids. Data mining is the computer-based interaction of extricating helpful data from gigantic arrangements of data sets. [2] Data mining is generally useful in an explorative investigation on the grounds of insightful data from enormous Volumes of proof. Clinical information-digging extraordinary potential for investigating the secretive examples in The enlightening files of clinical space. Such data should be assembled in a synchronized form.

This gathered data can be then used to shape a clinical information system. [5] Data mining gives a customer-arranged approach to managing narrative and concealed plans in the data. [7] In this research paper, we present an approach for predicting renal failure using several machine-learning classification-based modeling approaches. This paper dissects the renal failure expectations utilizing arrangement calculations. [13] The study evaluates the performance of various classifiers such as Decision Tree, Naive Bayes, Extreme Gradient Boosting, Logistic Regression, and Support Vector Machines using various performance evaluation metrics. This proposed method can be useful for early diagnosis and treatment of renal failure. By contrasting and analyzing the performance of these different models, we determine the best approach for predicting renal failure and provide insightful information for clinical practice.

II. LITERATURE REVIEW

An Gunarathne W.H.S.D et al. [1] compared different machine learning models and found that the Multiclass Decision Forest algorithm had the highest accuracy of approximately 99% on a reduced dataset with 14 attributes. However, it is important to note that a model's accuracy may depend on various factors and the findings may not generalize to other datasets or contexts.

Salekin and Stankovic [3] utilized a novel machine-learning approach to detect Chronic Kidney Disease in a dataset of 400 records and 25 attributes. Their study employed k-nearest neighbors, random forest, and neural network algorithms, along with a wrapper method for feature reduction, resulting in high accuracy in detecting Chronic Kidney Disease. The results indicated that the Chronic Kidney Disease detection accuracy was high. By utilizing this approach, the researchers demonstrated the potential for machine learning to improve Chronic Kidney Disease diagnosis and treatment.

Pinar Yildirim [4] investigated the impact of class imbalance on neural network algorithms for making medical decisions about Chronic Kidney Disease. The comparative study conducted using sampling algorithms demonstrated that their use can enhance the performance of classification algorithms. Moreover, the research highlighted the critical role of the learning rate in multilayer perceptron, significantly affecting its performance.

Guneet Kaur and Ajay Sharma [6] proposed a system for predicting Chronic Kidney Disease using Data Mining Algorithms in Hadoop. The study utilized two classifiers, KNN and SVM, and manually selected data columns for predictive analysis. The results indicated that SVM classifier outperformed KNN in accuracy, demonstrating the potential of this approach for Chronic Kidney Disease prediction.

Vasquez-Morales et al. [8] created a neural network model for predicting the risk of developing Chronic Kidney Disease, using a dataset of 40,000 instances. The accuracy of their model was reported as 95%.

Chen et al. [9] evaluated the performance of KNN, SVM, and SIMCA (Soft Independent Modelling of class Analogy) models for predicting the risk of Chronic Kidney Disease using a dataset from UCI. The SVM and KNN models achieved the highest accuracy of 99.7%, and SVM was found to be the most robust against noise disturbance.

Padmanaban and Parthiban [10] proposed the use of machine learning classifiers for early detection of Chronic Kidney Disease in diabetic patients. They collected data from a diabetes research center in Chennai and evaluated the performance of Naive Bayes and Decision tree algorithms using the Weka tool. Their study found that Naive Bayes classifier had the highest accuracy of 91%.

De Almeida et al. [11] conducted a study using Decision tree, Random Forest, and Support Vector Machine with various functions on the MIMIC-II database to predict Chronic Kidney Disease. They found that Decision tree and Random Forest had the highest accuracy, with prediction accuracies of 87% and 80%, respectively.

Deepika et al. [12] developed a Chronic Kidney Disease prediction project on a 24-attribute dataset using KNN and Naïve Bayes machine learning algorithms. The KNN algorithm achieved an accuracy of 97%, while the Naïve Bayes algorithm achieved an accuracy of 91%.

S. R. Raghavan, V. Ladik, and K. B. Meyer [14] suggested a decision support system called DARWIN, which is an intelligent software tool that assists doctors in determining the appropriate erythropoietin dosage for Chronic Kidney Disease patients. This system makes it simpler for doctors to calculate the dosage for thousands of patients within a month, which is a challenging task in the management of chronic kidney disease.

III. METHODOLOGY

A. Dataset

The dataset used here is taken from the UCI Machine Learning archive. UCI Is a collection of informational indexes that are used for complete AI estimations. The dataset used here is the certifiable dataset. The collection contains four-hundred events of data with the legitimate twenty-five clinical limits. The clinical limit of the dataset is about tests that are taken related to kidney ailment as diabetes mellitus, hypertension, coronary artery disease, anemia, red blood cell count, white blood cell count, etc.

Table 1. List of Attributes in the Dataset

Attributes	Type
Age	Numeric
Blood Pressure	Numeric
Specific Gravity	Numeric
Albumin	Numeric
Sugar	Numeric
Red Blood Cells	Nominal
Pus Cell	Nominal
Pus Cell Clumps	Nominal
Bacteria	Nominal
Blood Glucose Random	Numeric
Blood Urea	Numeric
Serum Creatinine	Numeric
Sodium	Numeric
Potassium	Numeric
Hemoglobin	Numeric
Packed Cell Volume	Numeric
Red Blood Cell Count	Numeric
White Blood Cell Count	Numeric
Hypertension	Nominal
Diabetes Mellitus	Nominal
Coronary Artery Disease	Nominal
Appetite	Nominal
Pedal Edema	Nominal
Anemia	Nominal
Class	Class

B. Architecture Diagram

C. Pre-Processing

The preprocessing main objective is to transform raw data into a format that can be easily used by machine learning algorithms. Using various techniques and methods, such as data cleaning, Handling Missing Values, and Outlier Detection, preprocessing can help to maximize the accuracy and effectiveness of machine learning algorithms. This allows the algorithms to identify patterns and relationships within the data, leading to more accurate and meaningful predictions and insights.

Data Cleaning: The dataset contains several missing values represented by question marks, which need to be replaced with NaN values. The dataset also contains some erroneous data that can be identified by performing data profiling and statistical analysis.
Handling Missing Values: The missing values in this dataset can be handled by imputing the missing values using median and mode imputation. Since the numerical data is skewed median imputation is done. For categorical data, mode imputation is done.
Outlier Detection: Outliers are data points that are significantly different from other data points in the dataset. Outliers can negatively affect the accuracy of machine learning models. Therefore, it is essential to detect and handle outliers in this dataset. Outliers are detected using Boxplots, Z-score, or Interquartile Range.

D. Classification Models

Logistic Regression: Logistic regression is a statistical technique used to predict a binary outcome (i.e., the presence or absence of renal failure in this case) by fitting a logistic function to a set of input variables. It is a supervised learning algorithm that works by calculating the probability of an event occurring based on the values of the input variables. The logistic regression model estimates the odds ratio for each input variable such as diabetes mellitus, hypertension, coronary artery disease, anemia, red blood cell count, white blood cell count, etc. These odds ratios are used to make predictions about the outcome.
Naive Bayes: Naive Bayes is a probabilistic algorithm used for classification tasks, where it predicts the probability of a given input variable belonging to a particular class. It works by calculating the conditional probabilities of each input variable given the class and using the Bayes theorem to calculate the probability of the class given the input variables. In the context of predicting renal failure, Naive Bayes would estimate the probability of renal failure based on the values of input variables such as diabetes mellitus, hypertension, coronary artery disease, anemia, red blood cell count, white blood cell count, etc.
Support Vector Machine: Support Vector Machine is a powerful supervised learning algorithm used for classification tasks. Support Vector Machine works by finding the hyperplane that maximizes the margin between the two classes, with the aim of achieving the best separation between the data points. [15] Support Vector Machine is widely used in machine learning and has been successfully applied in various domains, including medical diagnosis. In the case of predicting renal failure, Support Vector Machine would use the input variables to find the hyperplane that best separates patients with renal failure from those without.
Extreme Gradient Boost: Extreme Gradient Boost is a gradient boosting algorithm that uses an ensemble of decision trees to make predictions. It works by sequentially adding decision trees to the model, with each new tree learning to correct the errors of the previous trees. Extreme Gradient Boost is known for its speed, scalability, and accuracy, and it has been used in various applications including predicting medical outcomes such as the presence of renal failure.
Decision Tree: A Decision Tree is a tree-like model used for making decisions or classifications. It works by splitting the data into branches, with each branch representing a decision based on a particular input variable. In the context of predicting renal failure, a decision tree would use input variables such as diabetes mellitus, hypertension, coronary artery disease, anemia, red blood cell count, white blood cell count, etc., to determine the presence or absence of renal failure. The tree structure of decision trees allows for easy interpretation and visualization of the decision-making process, making them useful for medical diagnosis and other applications.

IV. RESULTS AND DISCUSSIONS

A. Performance Evaluation

The prediction model shall be evaluated to ensure that the model fits the dataset and work well on unseen data. The aim of the performance evaluation is to estimate the generalization accuracy of a model on unseen/out-of-sample data. Different performance evaluation metrics including accuracy, precision, recall, and f1-score have been computed. The confusion matrix helps us with this by describing the performance of the classifier. True Positive (TP) means a prediction made by a model that falls under the positive class and the instance actually falls under the positive class. True Negative (TN) means a prediction made by a model that falls under the negative class and the instance actually falls under the negative class. False Positive (FP) means a prediction made by a model that falls under the positive class but the instance actually falls under the negative class. False Negative (FN) means a prediction made by a model that falls under the negative class but the instance actually falls under the positive class. The above four measures [16] mentioned are used to evaluate the performance of several binary classification models and provide a more comprehensive understanding of their accuracy and reliability.

V. FUTURE ENHANCEMENTS

Incorporating the proposed method into clinical practice and evaluating its impact on patient outcomes and healthcare costs in a real-world setting. Incorporating real-time monitoring data from wearable devices to improve the early detection and diagnosis of renal failure. Utilizing the proposed method to predict renal failure in different populations and cultures to increase the generalizability of the findings. Developing a web-based or mobile application to make the proposed method more accessible to patients and healthcare providers. Combining the proposed method with other biomarkers to improve the early diagnosis of renal failure. Incorporating more advanced feature selection methods to improve the interpretability of the models.

Conclusion

In conclusion, this research proposed a machine learning-based approach for predicting renal failure using several classification techniques. The study evaluated the performance of these classifiers using various performance evaluation metrics. The Five machine learning algorithms were applied to the dataset. Applying the models on the dataset, we have got the highest accuracy with Naive Bayes, Decision Tree, and Extreme Gradient Boost. The accuracy was 98.75% for Extreme Gradient Boost and 97.50% for Decision Tree and Naive Bayes. 93.75% for Logistic Regression and Support Vector Machine. Logistic Regression and Support Vector Machine produced the lowest performance compared to Extreme Gradient Boost. Extreme Gradient Boost also produced the highest f1_score values. The proposed method was found to be effective in improving early diagnosis and treatment of renal failure, leading to better patient outcomes and reduced complications. The research provided valuable insights on the most suitable technique for early diagnosis and treatment of renal failure, and the results of this study may serve as a basis for further research in this field.

References

[1] Gunarathne, W. H. S. D., Perera, K. D. M., & Kahandawaarachchi, K. A. D. C. P. (2017, October). Performance evaluation on machine learning classification techniques for disease classification and forecasting through data analytics for chronic kidney disease (CKD). In 2017 IEEE 17th international conference on bioinformatics and bioengineering (BIBE) (pp. 291-296). IEEE. [2] Arasu, S. D., & Thirumalaiselvi, R. (2017). Review of chronic kidney disease based on data mining techniques. International Journal of Applied Engineering Research, 12(23), 13498-13505. [3] Salekin, A., & Stankovic, J. (2016, October). Detection of chronic kidney disease and selecting important predictive attributes. In 2016 IEEE International Conference on Healthcare Informatics (ICHI) (pp. 262-270). IEEE. [4] Yildirim, P. (2017, July). Chronic kidney disease prediction on imbalanced data by multilayer perceptron: Chronic kidney disease prediction. In 2017 IEEE 41st annual computer software and applications conference (COMPSAC) (Vol. 2, pp. 193-198). IEEE. [5] Snegha, J., Tharani, V., Preetha, S. D., Charanya, R., & Bhavani, S. (2020, February). Chronic kidney disease prediction using data mining. In 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE) (pp. 1-5). IEEE. [6] Kaur, G., & Sharma, A. (2017, November). Predict chronic kidney disease using data mining algorithms in hadoop. In 2017 international conference on inventive computing and informatics (ICICI) (pp. 973-979). IEEE. [7] Qin, J., Chen, L., Liu, Y., Liu, C., Feng, C., & Chen, B. (2019). A machine learning methodology for diagnosing chronic kidney disease. IEEE Access, 8, 20991-21002. [8] Vásquez-Morales, G. R., Martinez-Monterrubio, S. M., Moreno-Ger, P., & Recio-Garcia, J. A. (2019). Explainable prediction of chronic renal disease in the colombian population using neural networks and case-based reasoning. Ieee Access, 7, 152900-152910. [9] Chen, Z., Zhang, X., & Zhang, Z. (2016). Clinical risk assessment of patients with chronic kidney disease by using clinical data and multivariate models. International urology and nephrology, 48, 2069-2075. [10] Padmanaban, K. A., & Parthiban, G. (2016). Applying machine learning techniques for predicting the risk of chronic kidney disease. Indian Journal of Science and Technology, 9(29), 1-6. [11] De Almeida, K. L., Lessa, L., Peixoto, A., Gomes, R., & Celestino, J. (2020, January). Kidney failure detection using machine learning techniques. In 8th international workshop on advances in ICT infrastructures and services (ADVANCE 2020) (pp. 1-8). [12] Deepika, B., Rao, V. K. R., Rampure, D. N., Prajwal, P., & Gowda, D. G. (2020). Early prediction of chronic kidney disease by using machine learning techniques. Amer. J. Comput. Sci. Eng. The survey, 8(2), 7. [13] Shirahatti, A., Yadav, V., Vadde, N., Singh, A., Mahajan, P., & Sheikh, R. PREDICTION AND PREVENTIVE AWARENESS OF CHRONIC KIDNEY DISEASE USING MACHINE LEARNING ALGORITHMS. [14] Raghavan, S. R., Ladik, V., & Meyer, K. B. (2005). Developing decision support for dialysis treatment of chronic kidney failure. IEEE Transactions on Information Technology in Biomedicine, 9(2), 229-238. [15] Aprilianto, D. (2020). SVM optimization with correlation feature selection based binary particle swarm optimization for diagnosis of chronic kidney disease. Journal of Soft Computing Exploration, 1(1), 24-31. [16] Abinaya, U., Devi, S. A., Haritha, B., & Raghunathan, T. (2021, May). Noval approach for chronic kidney disease using machine learning methodology. In Journal of Physics: Conference Series (Vol. 1916, No. 1, p. 012164). IOP Publishing.

Copyright

Copyright © 2023 Prawin R P, Pranav R P, Swathi R. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET54343

Publish Date : 2023-06-22

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here